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1. INTRODUCTION 

Thalassemia is one of the main public health problems with highly prevalent in the area extending 
from sub-Saharan Africa, through the Mediterranean region and Middle East, to the Indian subcontinent and 
East and Southeast Asia [1], [2]. However, migrations of people caused thalassemia genes to spread 
throughout the world and extend to Indonesia. There are 7% of the world's population as carriers of 
thalassemia with the death of about 50,000-100,000 children [3]. In Indonesia, thalassemia is one of the most 
common chronic diseases [4]. Currently, thalassemia ranks 5th among non-communicable diseases after heart 
disease, cancer, kidney, and stroke with the number of carriers 3.8% of the entire population in Indonesia. 
Based on data from the Indonesian Thalassemia Foundation, there has been a steady increase in thalassemia 
cases from 2012 until 2018 [3]. 

Thalassemia is a genetic disease because of blood disorders inherited from family. Thalassemia 
sufferers' body makes an abnormal form or an inadequate amount of hemoglobin [1], [5]. Hemoglobin allows 
red blood cells to carry oxygen [6]. When there is not enough hemoglobin, the body’s red blood cells do not 
function properly, and they die more quickly. And then, the oxygen delivered to all the other cells of the body 
is not enough. 

The cause of thalassemia is mutations in the DNA of cells that make hemoglobin [7]. Hemoglobin is 
made of two different parts, called alpha and beta. Therefore, there are two types of thalassemia, such as 
alpha-thalassemia or beta-thalassemia. According to [8], the new classification has been simplified based on 
the way of treatment namely non-transfusion-dependent thalassemia (NIDT) and transfusion-dependent 
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thalassemia (TDT). Because of differences in treatment, early detected thalassemia with a screening process 
is necessary to help thalassemia suffers to get the right treatment. The aim is to increase their life expectancy 
and reduce the risk of thalassemia to the next generation. Thus, it is important to obtain a precise thalassemia 
diagnosis. 

Nowadays, in healthcare, it is significant to invest the development in computer technology to 
enhance processing the medical data [5]. Machine learning technologies, one of computer technology, can 
help us in classification problems on large datasets. It has an important role because it can be applied in daily 
life such as biomedical data. However, there are several interesting challenges recently such as our data may 
come from multiple heterogeneous sources, our data may have a huge number of samples and require a 
method to understand the complex model, and our data may have few samples but lie in high dimension and 
is spatiotemporal. New developments in statistics and kernel methods is required to these challenges [9]. 

There are some methods on previous researches to classify thalassemia, such as fuzzy kernel robust 
C-means, fuzzy C-means, and fuzzy kernel C-means [4], neural networks and genetic programming [10], 
artificial intelligence algorithms [11], artificial neural network [12], and naive bayes [13]. Also, [12], [14] 
used SVM that showed good result with 93.2% accuracy and 100% AUC respectively. 

This research used some of kernel functions with support vector machine (SVM) to classify 
thalassemia. SVM can be modified with various kernel functions, as an essential component, to get a better 
result. Therefore, a comparison between that essential component for classifying thalassemia should be done. 
It will help the medical staff to overcome the classification problems. This research discussed some of the 
kernel functions such as the linear kernel, polynomial kernel, and gaussian radial basis kernel. The aim is to 
find out which kernel function that gives the highest accuracy for classifying thalassemia in the SVM 
method. 


2. RESEARCH METHOD 

Support vector machines (SVM) is supervised machine learning. Originally, SVM algorithm 
proposed by Vapnik and Lerner [15], [16]. SVM can be applied for classification and regression [17], [18]. It 
claimed that SVM is a method that has a high accuracy for classification [19]. Mapping form input space to a 
higher dimensional space is the idea of SVM. SVM constructs a hyperplane to separate data into classes [20]. 
The selected hyperplanes are those that maximize the margin of classification edges [21]. 

Let {x;,y;}" is the dataset where , x;€R? is feature of vector, y; is class label for x;, and N is the 
number of samples. To find the best hyperplane, this is main formula of support vector machines: 


f(x) =w-:x+b (1) 
That formula contains w (weight) as the orthogonal vector to the hyperplane determining its orientation, b 
(bias) as the distance from the origin to the hyperplan, and x indicates the training sample [22]. The aim is to 
maximize the margin. 

Moreover, SVM goal is construct the two planes, let say H1 and H2, as (2) and (3): 

H, > w'x,;+b=+1 fory; = +1 (2) 

H, > w'x;+b=-1 fory;= —-1 (3) 


where the plane for the positive class is wx; + b > +1 is and the plane for the negative class is w’x; +b < 
—1. See Figure | illustrate the hyperplane in SVM. The problem of SVM optimization can be written as: 


Minimize ; \|w|[2 (4) 
Sia Ww “a 2 by Sb vi = Tah (5) 
By solving the problem above, formula of w and b can be written as: 
-_yy 
W = dint UYiXi (6) 
1 
b= ny dies Vi — Yimes Im¥mXm) (7) 


Then, decision formulas of SVM can be written as: 
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f(x) = sign(w : x + Db) (8) 
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Figure 1. Illustration of SVM [16] 


SVM has several advantages, such as its capability to process data with large amounts in high 
dimensions [23]. Also, SVM implemented easily using linear boundaries as shown in Figure 1. However, 
there are classification problems where can not using a linear boundary to separate classes [24]. See Figure 2, 
that case is non-linear separable data. The best way to approach a non-linear decision boundary is to expand 
the original feature space. Nevertheless, it makes computations intractable because the original feature is 
enlarged to high dimensional space. To tackle that issue, we applied the 'kernel trick' using a kernel function. 
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Figure 2. Non-linear separable data [25] 


SVM classification performance closely relies on the kernel function [26]. Therefore, a kernel 
function is the most essential component to make the SVM method get higher accuracy [27]. When a task is 
difficult in the original problem space, kernel function helps to transform input space into another space 
where we can work easier [25]. On another word, kernel function work for transforming data into a higher- 


dimensional space [28], [29]. Its approach is mapping data into kernel space where data become linearly 
separable [26]. 


The kernel function can be written as: 
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«(i.%)) =< oC), 0(%)) > a 


Example, we construct lifting map y: xy > H with @: (x1,x2) > (x? + V2x,x2 + x3). This map lifting the 
data from vy = R? to H = R?® [30]. Therefore, ~ mapping data from dimensional space to feature space. 
The problem of SVM optimization will be as follows: 


Minimize ||w|[? ol ee es 
s.ty; (W’ : g(x) + b)-1+6,= 0,Vi = 1,...,N 
By solving the problem above, formula of w and b will be as (10) and (11): 
w* = YL, aie (xi) (10) 
bY = 5 Dies Vi — Lines Yin PXm)) (11) 
Then, decision formulas of SVM will be as (12): 


f(x) = sign(w* - p(x) + D*) (12) 


where €; is slack variable or measure of the misclassification errors which should be minimize. C is the 
penalty or determines the trade-off between the minimization of error and the maximization of the 
classification margin. 

In this research, authors proposed three kernels which applied for thalassemia classification: 
a. Gaussian radial basis kernel 


2 
K(x;,%x;) = exp Ha (13) 
Wheres o is the only parameter that defines width kernel. Its impact to close or far a single training 
sample reaches. Also, o can defined as the radius of influence of samples which is affected by the 
classification model. From research in [16], a small o indicates the width of the kernel is small so the model 
focuses on a small set of data and the new hypersurface will be spiky. It may leads to an overfitting problem. 
The opposite, a high o increases the kernel width and then most of the data are transformed into a flat 
hyperspace which leads to the underfitting problem. 
b. Polynomial kernel 


K(%i,%)) = (< x14) > +1)° (14) 


Wheres d is degree of polynomial kernel function. From research in [16], high degree would 
increase the complexity of the classification model. It can be seen as overfitting problem because testing error 
increases but training error decreases. The opposite, with a small d may leads to a high bias and low variance 
or underfitting problem. 

c. Linear kernel: 


e(xj,%;) = x7 %; (15) 


This kernel function is the simplest kernel function which the results of learning algorithms are 
often equivalent to SVM without kernel functions [16]. By comparing these kernels, the expectation is we 
know which kernel gives the highest accuracy. To calculate the accuracy, a confusion matrix is used. The 
formula for accuracy is: 


Tp+Tn 
Tp+Tyt+ Fpt+Fyn 


accuracy = (16) 


bh T 
precision = —?— (17) 
Tpt+ Fp 
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Tp 
Tp+Fn 


recall = (18) 


2x precision x recall 


F1 Score = (19) 


precision+ recall 


Tp: Number of samples having thalassemia classified correctly. 

Fp: Number of healthy people that were incorrectly classified to thalassemia. 

Fy: Number of samples with thalassemia that were incorrectly classified as healthy. 
Ty: Number of healthy individuals correctly spotted. 


3. RESULTS AND DISCUSSION 

In this paper, thalassemia data received from Harapan Kita Children and Women's Hospital, 
Indonesia, and it consist of 150 samples. The dataset of thalassemia represented by 10 variables such as 
Hemoglobin (g/dL), Haematocrit Percent (%), Leukocyte Count (103/uL), Basophils Percent (%), 
Eosinophils Percent (%), Rod Neutrophils Percent (%), Segment Neutrophils Percent (%), Lymphocytes 
Percent (%), Monocytes Percent (%), and Platelet Counts (103/uL). By default authors utilized the Shapiro- 
Wilk algorithm to assess the normality of the distribution of instances with respect to the feature. A barplot as 
shown in Figure 3, is then drawn showing the relative ranks of each feature. Platet Counts has the highest 
ranking. 
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Figure 3. Ranking of thalassemia data features with saphiro algorithm 


This research used training data diverse from 10% to 90% and used o = 0.1 for Gaussian RBF 
kernel and d=3 for polynomial kernel. The reason is, from the number of the experiment that is obtained, 
o = 0.1 and d=3 has the best performance. This chosen o = 0.1 is also supported by [16]. 

It is shown in Table 1, the SVM model with a gaussian radial basis function kernel produces the best 
accuracy for classifying thalassemia data with an average of accuracy 99.63%. The second-best is a linear 
kernel with 98.23% accuracy. The last one is a polynomial kernel with 97.9% accuracy. Linear kernel SVM 
has the best accuracy of 100% with 10% and 30% training data. On the other side, the polynomial kernel has 
the best accuracy of 100% if the model uses 10-30% and 50% training data. And for gaussian radial basis 
function gives the best accuracy with 10-50%, 70%, and 80% training data. For Fl Score, gaussian radial 
basis still the best one. In Table 2, the gaussian radial basis kernel gives the best performance with an average 
precision of 99.56% and an average recall of 99.78%. However, there is a difference in second place between 
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precision and recall. SVM linear is in second place for precision, while for recall SVM polynomial is in 
second place. 


Table 1. The accuracy and Fl score of SVM with kernel function 


aainine Accuracy Fl Score 

Data SVM SVM ; SVM SVM SVM SVM 
Linear Polynomial Gaussian Linear Polynomial | Gaussian 

10% 100.00 100.00 100.00 100.00 100.00 100.00 
20% 96.67 100.00 100.00 96.00 100.00 100.00 
30% 100.00 100.00 100.00 100.00 100.00 100.00 
40% 98.00 96.67 100.00 98.00 96.00 100.00 
50% 98.67 100.00 100.00 99.00 100.00 100.00 
60% 98.89 97.77 98.89 99.00 98.00 99.00 
10% 97.14 96.19 100.00 98.00 96.00 100.00 
80% 99.16 97.5 100.00 98.00 100.00 100.00 
90% 95.56 93.33 97.78 97.00 93.00 99.00 

Average 98.23 97.94 99.63 98.33 98.11 99.78 


Table 2. The precision and recall of SVM with kernel function 


"iedihing Precision Recall 

Data SVM SVM SVM SVM SVM SVM 
Linear Polynomial Gaussian Linear Polynomial Gaussian 

10% 100.00 100.00 100.00 100.00 100.00 100.00 
20% 100.00 100.00 100.00 93.00 100.00 100.00 
30% 100.00 100.00 100.00 100.00 100.00 100.00 
40% 100.00 100.00 100.00 95.00 94.00 100.00 
50% 98.00 100.00 100.00 98.00 100.00 100.00 
60% 100.00 100.00 100.00 98.00 96.00 98.00 
710% 95.00 95.00 100.00 98.00 98.00 100.00 
80% 100.00 100.00 100.00 94.00 96.00 100.00 
90% 100.00 97.00 96.00 95.00 92.00 100.00 

Average 99.22 99.11 99.56 96.78 97.33 99.78 


Authors also used other machine learning, such as KNN with k=7 and random forest. The result is 
90% accuracy from KNN and 100% accuracy from random forest. Nevertheless, SVM with some of the 
kernel functions still give the highest accuracy, 100%, so it can be said that SVM performed the best machine 
learning method to classify thalassemia. 


4. CONCLUSION 

Machine learning can help medical staff to classify thalassemia disease precisely. If early detection 
is done, patients can get the right treatment. It helps them increase their life expectancy and reduce the risk of 
thalassemia to the next generation. In this research, there are three kernel functions used in SVM with linear, 
polynomial, and gaussian radial basis function kernel. Kernel function can help SVM to transform input 
space into a higher-dimensional space where we can work easier. 

From this research, support vector machine with gaussian RBF kernel is the best one to classify 
thalassemia data from Harapan Kita Children and Women's Hospital, Indonesia. We can see in Table 1, each 
kernel performs the highest accuracy. However, if we see the average accuracy, gaussian RBF is the best one 
with an accuracy of 99.63%. The second-best is a linear kernel with 98.23% accuracy. The last one is a 
polynomial kernel with 97.9% accuracy. Besides that, the gaussian radial basis also gives the highest Fl 
score of 99.78%. Also, in Table 2, the gaussian RBF kernel has the highest average of precision and recall 
with 99.56% and 99.78% respectively. For future research, use a larger dataset is recommended to generate 
higher accuracies in each method. Also, we believe that future research can develop this method to give the 
best accuracy for predicting or classifying other diseases. 
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